A Dynamic Adaptation of AD-trees for Efficient Machine Learning on Large Data Sets

نویسندگان

  • Paul Komarek
  • Andrew W. Moore
چکیده

This paper has no novel learning or statistics: it is concerned with making a wide class of preexisting statistics and learning algorithms computationally tractable when faced with data sets with massive numbers of records or attributes. It briefly reviews the static AD-tree structure of Moore and Lee (1998), and offers a new structure with more attractive properties: (1) the new structure scales better with the number of attributes in the data set; (2) it has zero initial build time; (3) it adaptively caches only statistics relevant to the current task; and (4) it can be used incrementally in cases where new data is frequently being appended to the data set. We provide a careful explanation of the data structure, and then empirically evaluate the performance under varying access patterns induced by different learning algorithms such as association rules, decision trees and Bayes net structures. We conclude by discussing the longer term benefits of the new structure: the eventual ability to apply AD-trees to data sets with real-valued attributes. 1. Description of AD-trees 1.1 What is an AD-tree? Table 1 shows a tiny data set with M 3 symbolic (i.e., categorical) attributes (the columns), and R 6 records (the rows). A counting query has the form C(a1 2 a2 a3 1), and is a request to count the number of records matching the query, with asterisks interpreted as “don’t cares”. C(a1 2 a2 a3 1)=3 in our example. Moore and Lee (1998) and Anderson and Moore (1998) introduced a new data structure for representing the cached counting statistics for a categorical data set, called an AllTable 1. Sample data set with three attributes and six records. ATTRIBUTES: a1 a2 a3 RECORD1 1 1 1 RECORD2 2 3 1 RECORD3 2 4 2 RECORD4 1 1 1 RECORD5 2 3 1 RECORD6 2 3 1 a2=1 MCV a2=2 Null a2=3 Null a2=4 Null a3=1 MCV a3=2 Null a3=1 MCV a3=2 Null a3=1 Null a3=2 MCV Vary a2 Vary a3 Vary a3 Vary a3 c=2 a2=1 a2=2 Null a2=3 MCV c=1 a2=4 c=2 a1=1 a1=2 MCV Vary a1 Vary a2 count=6 a3=* a2=* a1=*

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Image Classification via Sparse Representation and Subspace Alignment

Image representation is a crucial problem in image processing where there exist many low-level representations of image, i.e., SIFT, HOG and so on. But there is a missing link across low-level and high-level semantic representations. In fact, traditional machine learning approaches, e.g., non-negative matrix factorization, sparse representation and principle component analysis are employed to d...

متن کامل

Machine Learning and Citizen Science: Opportunities and Challenges of Human-Computer Interaction

Background and Aim: In processing large data, scientists have to perform the tedious task of analyzing hefty bulk of data. Machine learning techniques are a potential solution to this problem. In citizen science, human and artificial intelligence may be unified to facilitate this effort. Considering the ambiguities in machine performance and management of user-generated data, this paper aims to...

متن کامل

An Intelligent Machine Learning-Based Protection of AC Microgrids Using Dynamic Mode Decomposition

An intelligent strategy for the protection of AC microgrids is presented in this paper. This method was halving to an initial signal processing step and a machine learning-based forecasting step. The initial stage investigates currents and voltages with a window-based approach based on the dynamic decomposition method (DDM) and then involves the norms of the signals to the resultant DDM data. T...

متن کامل

Reachability checking in complex and concurrent software systems using intelligent search methods

Software system verification is an efficient technique for ensuring the correctness of a software product, especially in safety-critical systems in which a small bug may have disastrous consequences. The goal of software verification is to ensure that the product fulfills the requirements. Studies show that the cost of finding and fixing errors in design time is less than finding and fixing the...

متن کامل

Creating Dynamic Sub-Route to Control Congestion Based on Learning Automata Technique in Mobile Ad Hoc Networks

Ad hoc mobile networks have dynamic topology with no central management. Because of the high mobility of nodes, the network topology may change constantly, so creating a routing with high reliability is one of the major challenges of these networks .In the proposed framework first, by finding directions to the destination and calculating the value of the rout the combination of this value with ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2000